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Abstract 

Bias - variance decomposition of the expected error defined for re- 
gression and classification problems is an important tool to study and 
compare different algorithms, to find the best areas for their application. 
q Here the decomposition is introduced for the survival analysis problem. 

i i In our experiments, we study bias -variance parts of the expected er- 

ror for two algorithms: original Cox proportional hazard regression and 

' ' CoxPath, path algorithm for Li-regularized Cox regression, on the series 

of increased training sets. The experiments demonstrate that, contrary 
expectations, CoxPath does not necessarily have an advantage over Cox 
regression. 

in 

1 Introduction 

^ | For classification problems, it is well known that bias and variance components 

of the estimation prediction error combine to influence classification in a very 
different way, and have different importance depending on the sample size. For 
small and for high-dimensional datasets, variance of the prediction caused by 
5h variations in the training samples makes largest contribution into the expected 

prediction error. For large datasets, bias becomes more important component 
of the error pQ. 

Thus, the decomposition of expected error into bias and variance parts is an 
important tool to understand differences between the algorithms, to find areas 
of the optimal application. 

To the best of author's knowledge, such decomposition was not proposed 
for survival analysis problem. Here we describe an approach to define this 
decomposition for this class of problem. On two real life datasets we study how 
bias and variance of we show how regularization and size of the training sample 
affect bias, variance and overall errors of the methods. 
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2 Bias - Variance Decomposition 



2.1 Survival analysis problem 

Survival analysis deals with the datasets, where each observation has three 
components: covariate vector x, a positive survival time t and an event indicator 
5, which is equal to 1 if an event (failure) occurred, and zero if the observation 
is (right) censored at time t. 

The prediction in survival analysis is generally understood as an estimate of 
an individual's risk, but the concept of the risk is open for interpretation. The 
commonly accepted criterion of the accuracy of the risk modeling is Harrell's 
concordance index [2] measuring agreement between the model's scores and 
the order of the failure times. The criterion is not directly related with any 
particular interpretation of the scores. 

Because of the presence of censored observations, failure times define only 
partial order on observations. Two observations c\ — {xi,ti,6i} and C2 — 
1x2^2,62} are ordered c\ -< C2 if and only if t\ < t% and 5\ = 1. In case 
of absent ties, the concordance index equals proportion of correctly ordered 
(concordant) pairs of observations: 



Then the survival analysis can be considered as a problem discerning between 
concordant and discordant pairs of observations. If the features are continuous, 
the ties are rare, and proportion of the concordant pairs closely approximates 
concordance index. This allows us to study bias- variance decomposition for sur- 
vival analysis using available bias -variance decomposition for the classification 
problems. 

2.2 Bias - variance decomposition for classification prob- 



For the binary classification problems, the commonly used bias- variance decom- 
position of the classification error E(C) is proposed in [3J: 



CI = 



concordant + 0.5 • ties 



discordant + concordant + ties 
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where Yjj is the classification obtained on the training set H, Yp is the actual 
class values. The bias is a measure of closeness between the distributions of 
values Yjj(x) over the training sets H of a fixed size, and the distribution of 
Yp(x), u\ represents the level of noise in the class variable, and the variance x is 
an estimate of variability of the decisions on the training sets of the given size. 

The more data - sensitive is the learning algorithm the less bias it has. 
The notion of bias - variance tradeoff [T] refers to the fact that the lower is 
the algorithm's bias, the larger shall be dependence of the learned function on 
the training set, especially when number of the training cases is small, or the 
dimensionality of the data is high. 

Bias and variance of algorithms depend on the size of the training set, com- 
plexity of the learned function, and many other factors, including specifics of 
particular data. Here we explore these components for two algorithms on two 
real datasets. 

3 The algorithms under comparison 

In the traditional approach associated with sir David Cox [4], the research is 
concentrated on a time-dependent "hazard function" A(x,t): event rate at time 
t conditional on survival of the individual x until time t or later (that is, T > t). 

Cox proportional hazard (PH) regression is based on the strong assumption 
that the hazard function has the form of 

A(x,t) = A(i) • exp(/3(x)), 

where X(t) is unknown time-dependent function, common for all individuals 
in the population. The assumption implies, in particularly, that for any two 
individuals, their hazards are proportional all the time. So, the result of the 
modeling is, actually, not the individual time-dependent hazard functions, but 
rather these "proportionality" scores. 

Most of advanced methods for prediction in survival analysis are developed 
to make this traditional approach more robust against overfitting on sparse data 
(see surveys in 012]). Among the regularization methods, L\ -penalized Cox 
regression is the most attractive because it produces concise interpretable rules. 
The CoxPath [7] is path algorithm which builds L\ regularized proportional 
hazard regression models with series of values of the regularization parameter 
A, and then it selects one of the solution based on the performance criterion. 
Regularization lowers algorithm variance, making an individual regression model 
more robust against variations between small training sets. Selection of the best 
model is intended to improve the bias of the algorithm. According to the bias - 
variance tradeoff, neither step necessarily decreases overall prediction error. In 
the next section we describe the results of the experiments evaluating the bias 
and variance of these algorithms on series of increased training sets. 



3 



4 Computational Experiments 



For real life datasets, only variance and prediction error can be measured di- 
rectly. The sum of bias and measure of noise a 2 constituting unavoidable error 
was evaluated as the difference between the prediction error and the variance. 

In the experiments, first, 20% of the whole sample was set aside as a test set. 
From the rest, increasing subsets of the data were randomly selected as training 
sets; 20 training sets of each size were selected. All the methods were trained 
on the training sets and the models applied on the single test set to evaluate 
variance on the test data. The procedure was repeated 10 times with randomly 
chosen test sets, and average variance and performance for each training set size 
was evaluated across all 10 test sets. 

The experiment was conducted on two datasets. 

• PBC : This data is from the Mayo Clinic trial in primary biliary cirrhosis of 
the liver conducted between 1974 and 1984 [5]. Patients are characterized 
by standard description of the disease conditions. The dataset has 17 
features and 228 observations. 

• Ro02s: the dataset from [9] contains information about 240 patients with 
lymphoma. Using hierarchical cluster analysis on whole dataset and ex- 
pert knowledge about factors associated with disease progression, the au- 
thors identified relevant four clusters and a single gene out of the 7399 
genes on the lymphochip. Along with gene expressions, the data include 
two features for histological grouping of the patients. The authors ag- 
gregated gene expressions in each selected cluster to create a signatures 
of the clusters. The signatures, rather than gene expressions themselves 
were used for modeling. The dataset with aggregated data has 7 features. 

The results are presented on the Figures 1, 2, where we show methods per- 
formance as 1 — E(C). Bias of both methods is almost indistinguishable on both 
datasets, and is not shown here. The figures show that CoxPath does not have 
consistent advantage over Cox PH regression. On PBC dataset, CoxPath has 
lower variance and better performance for all sample sizes, while for Ro2 dataset 
the opposite is true. 

One can hypothesize that an advantage in variance CoxPath obtained due 
to regularization was offset by the additional sensitivity to the training data due 
to the model selection. 

Additional experiments with artificial datasets and L\ -regularized Cox PH 
regression with fixed parameter A may help better understand the factors af- 
fecting bias and variance of the methods and to produce recommendations for 
the types of data, for which one or another method is preferable. 
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Figure 1: Ro02 dataset, Variance and Performance of Cox PHR and CoxPath 
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Performance depending on sample size 



Variance depending on sample size 




Figure 2: PBC dataset, Variance and Performance of Cox PHR and CoxPath 
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